Goto

Collaborating Authors

 manual review


AI-Powered Citation Auditing: A Zero-Assumption Protocol for Systematic Reference Verification in Academic Research

van Rensburg, L. J. Janse

arXiv.org Artificial Intelligence

Academic citation integrity faces persistent challenges, with research indicating 20% of citations contain errors and manual verification requiring months of expert time. This paper presents a novel AI-powered methodology for systematic, comprehensive reference auditing using agentic AI with tool-use capabilities. We develop a zero-assumption verification protocol that independently validates every reference against multiple academic databases (Semantic Scholar, Google Scholar, CrossRef) without assuming any citation is correct. The methodology was validated across 30 academic documents (2,581 references) spanning undergraduate projects to doctoral theses and peer-reviewed publications. Results demonstrate 91.7% average verification rate on published PLOS papers, with successful detection of fabricated references, retracted articles, orphan citations, and predatory journals. Time efficiency improved dramatically: 90-minute audits for 916-reference doctoral theses versus months of manual review. The system achieved <0.5% false positive rate while identifying critical issues manual review might miss. This work establishes the first validated AI-agent methodology for academic citation integrity, demonstrating practical applicability for supervisors, students, and institutional quality assurance.


Automating Thematic Review of Prevention of Future Deaths Reports: Replicating the ONS Child Suicide Study using Large Language Models

Osian, Sam, Dutta, Arpan, Bhandari, Sahil, Buchan, Iain E., Joyce, Dan W.

arXiv.org Artificial Intelligence

Prevention of Future Deaths (PFD) reports, issued by coroners in England and Wales, flag systemic hazards that may lead to further loss of life. Analysis of these reports has previously been constrained by the manual effort required to identify and code relevant cases. In 2025, the Office for National Statistics (ONS) published a national thematic review of child-suicide PFD reports ($\leq$ 18 years), identifying 37 cases from January 2015 to November 2023 - a process based entirely on manual curation and coding. We evaluated whether a fully automated, open source "text-to-table" language-model pipeline (PFD Toolkit) could reproduce the ONS's identification and thematic analysis of child-suicide PFD reports, and assessed gains in efficiency and reliability. All 4,249 PFD reports published from July 2013 to November 2023 were processed via PFD Toolkit's large language model pipelines. Automated screening identified cases where the coroner attributed death to suicide in individuals aged 18 or younger, and eligible reports were coded for recipient category and 23 concern sub-themes, replicating the ONS coding frame. PFD Toolkit identified 72 child-suicide PFD reports - almost twice the ONS count. Three blinded clinicians adjudicated a stratified sample of 144 reports to validate the child-suicide screening. Against the post-consensus clinical annotations, the LLM-based workflow showed substantial to almost-perfect agreement (Cohen's $κ$ = 0.82, 95% CI: 0.66-0.98, raw agreement = 91%). The end-to-end script runtime was 8m 16s, transforming a process that previously took months into one that can be completed in minutes. This demonstrates that automated LLM analysis can reliably and efficiently replicate manual thematic reviews of coronial data, enabling scalable, reproducible, and timely insights for public health and safety. The PFD Toolkit is openly available for future research.


CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation

Yin, Wenjing, Sun, Tianze, Yu, Yijiong, Fang, Jiawei, Su, Guangyao, Wang, Jiancheng, Wang, Zekun, Wang, Wei, Chen, Ran, Dai, Ziyun, Yuan, Shuai, Dong, Menghang, Luo, Peng, Cao, Dong, Lei, Da, Zhang, Yajun, Chen, Hao, Ma, Xiang, Liu, Yong, Liu, Weifeng, Xu, Yuanjian, Pei, Ji

arXiv.org Artificial Intelligence

Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance. However, existing benchmarks are often narrow in scope, focusing on a specific task and lack a comprehensive evaluation framework that reflects real-world applications. To address these gaps, we introduce CoCo-Bench (Comprehensive Code Benchmark), designed to evaluate LLMs across four critical dimensions: code understanding, code generation, code modification, and code review. These dimensions capture essential developer needs, ensuring a more systematic and representative evaluation. CoCo-Bench includes multiple programming languages and varying task difficulties, with rigorous manual review to ensure data quality and accuracy. Empirical results show that CoCo-Bench aligns with existing benchmarks while uncovering significant variations in model performance, effectively highlighting strengths and weaknesses. By offering a holistic and objective evaluation, CoCo-Bench provides valuable insights to guide future research and technological advancements in code-oriented LLMs, establishing a reliable benchmark for the field.


PHEONA: An Evaluation Framework for Large Language Model-based Approaches to Computational Phenotyping

Pungitore, Sarah, Yadav, Shashank, Subbian, Vignesh

arXiv.org Artificial Intelligence

Computational phenotyping is essential for biomedical research but often requires significant time and resources, especially since traditional methods typically involve extensive manual data review. While machine learning and natural language processing advancements have helped, further improvements are needed. Few studies have explored using Large Language Models (LLMs) for these tasks despite known advantages of LLMs for text-based tasks. T o facilitate further research in this area, we developed an evaluation framework, Evaluation of PHEnotyping for Observational Health Data (PHEONA), that outlines context-specific considerations. W e applied and demonstrated PHEONA on concept classification, a specific task within a broader phenotyping process for Acute Respiratory Failure (ARF) respiratory support therapies. From the sample concepts tested, we achieved high classification accuracy, suggesting the potential for LLM-based methods to improve computational phenotyping processes.


AI for Scaling Legal Reform: Mapping and Redacting Racial Covenants in Santa Clara County

Surani, Faiz, Suzgun, Mirac, Raman, Vyoma, Manning, Christopher D., Henderson, Peter, Ho, Daniel E.

arXiv.org Artificial Intelligence

Legal reform can be challenging in light of the volume, complexity, and interdependence of laws, codes, and records. One salient example of this challenge is the effort to restrict and remove racially restrictive covenants, clauses in property deeds that historically barred individuals of specific races from purchasing homes. Despite the Supreme Court holding such racial covenants unenforceable in 1948, they persist in property records across the United States. Many jurisdictions have moved to identify and strike these provisions, including California, which mandated in 2021 that all counties implement such a process. Yet the scale can be overwhelming, with Santa Clara County (SCC) alone having over 24 million property deed documents, making purely manual review infeasible. We present a novel approach to addressing this pressing issue, developed through a partnership with the SCC Clerk-Recorder's Office. First, we leverage an open large language model, finetuned to detect racial covenants with high precision and recall. We estimate that this system reduces manual efforts by 86,500 person hours and costs less than 2% of the cost for a comparable off-the-shelf closed model. Second, we illustrate the County's integration of this model into responsible operational practice, including legal review and the creation of a historical registry, and release our model to assist the hundreds of jurisdictions engaged in similar efforts. Finally, our results reveal distinct periods of utilization of racial covenants, sharp geographic clustering, and the disproportionate role of a small number of developers in maintaining housing discrimination. We estimate that by 1950, one in four properties across the County were subject to racial covenants.


More efficient manual review of automatically transcribed tabular data

Pedersen, Bjørn-Richard, Johansen, Rigmor Katrine, Holsbø, Einar, Sommerseth, Hilde, Bongo, Lars Ailo

arXiv.org Artificial Intelligence

Machine learning methods have proven useful in transcribing historical data. However, results from even highly accurate methods require manual verification and correction. Such manual review can be time-consuming and expensive, therefore the objective of this paper was to make it more efficient. Previously, we used machine learning to transcribe 2.3 million handwritten occupation codes from the Norwegian 1950 census with high accuracy (97%). We manually reviewed the 90,000 (3%) codes with the lowest model confidence. We allocated those 90,000 codes to human reviewers, who used our annotation tool to review the codes. To assess reviewer agreement, some codes were assigned to multiple reviewers. We then analyzed the review results to understand the relationship between accuracy improvements and effort. Additionally, we interviewed the reviewers to improve the workflow. The reviewers corrected 62.8% of the labels and agreed with the model label in 31.9% of cases. About 0.2% of the images could not be assigned a label, while for 5.1% the reviewers were uncertain, or they assigned an invalid label. 9,000 images were independently reviewed by multiple reviewers, resulting in an agreement of 86.43% and disagreement of 8.96%. We learned that our automatic transcription is biased towards the most frequent codes, with a higher degree of misclassification for the lowest frequency codes. Our interview findings show that the reviewers did internal quality control and found our custom tool well-suited. So, only one reviewer is needed, but they should report uncertainty.


Protecting payments in an era of deepfakes and advanced AI

#artificialintelligence

In the midst of unprecedented volumes of e-commerce since 2020, the number of digital payments made every day around the planet has exploded – hitting about $6.6 trillion in value last year, a 40 percent jump in two years. With all that money flowing through the world's payments rails, there's even more reason for cybercriminals to innovate ways to nab it. To help ensure payments security today requires advanced game theory skills to outthink and outmaneuver highly sophisticated criminal networks that are on track to steal up to $10.5 trillion in "booty" via cybersecurity damages, according to a recent Argus Research report. Payment processors around the globe are constantly playing against fraudsters and improving upon "their game" to protect customers' money. The target invariably moves, and scammers become ever more sophisticated.


3 Ways IQ Bot Enables Financial Process Automation

#artificialintelligence

Most researchers agree approximately 80% of any organization's data is hidden in multiple sources, such as emails, PDF application forms, and paper documents. This data often goes unused because of the time and resources required to get meaningful information from it. Even then, the data often needs to be manually rekeyed into multiple locations. Combined with Robotic Process Automation (RPA), IQ Bot enables banking, financial services, and insurance (BFSI) companies to take advantage of intelligent document processing (IDP) to extract valuable data and streamline operations. IQ Bot blends multiple artificial intelligence (AI) technologies, such as computer vision, machine learning, and natural language processing (NLP), to glean relevant information from any type of document, add structure to the data, and deliver the results to multiple applications.


The future of AI in finance is here: Reducing the cost of accuracy

#artificialintelligence

Artificial intelligence and machine learning (AI/ML) have already transformed industries and changed the way work gets done across the enterprise. While finance has traditionally lagged behind other departments in the AI adoption curve, that's starting to change. Adoption of AI in finance is being spurred by digital natives (professionals who grew up in a connected world), with tech solutions finally delivering on the promise of AI/ML. Finance professionals accustomed to modern technology experiences in other areas of their lives are no longer willing to endure painstaking manual reviews and the threat of inaccurate data in their forecasts and plans. Outside of finance, many other areas of the businesses are far beyond cutting their teeth when it comes to using AI to improve forecasting and drive decision-making.


eCommerce, Delivery And The Gig Economy Create Opportunities For Both Fraud And The Artificial Intelligence To Detect It

#artificialintelligence

The first area most people think of with fraud is finance. That extends past scammers and includes a wide range of attacks including banking and trades. There has been much discussion on how artificial intelligence (AI) is being used to address wider areas of fraud, such as in pharmaceutical prescription fraud. Last year saw a phenomenal growth in the use of online marketplaces and delivery services. The growth of fraud in those areas also increased.